Software Thread Integration for Converting Tlp to Ilp on Vliw/epic Architectures
نویسنده
چکیده
SO, WON. Software Thread Integration for Converting TLP to ILP on VLIW/EPIC Architectures. (Under the direction of Alexander G. Dean.) Multimedia applications are pervasive in modern systems. They generally require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word) or EPIC (Explicitly Parallel Instruction Computing). Despite many efforts to exploit instruction level parallelism (ILP) in the application, typical utilization levels for compiler-generated VLIW/EPIC code range from one-eighth to one-half because a single instruction stream has limited ILP. Software Thread Integration (STI) is a software technique which interleaves multiple threads at the machine instruction level. Integration of threads increases the number of independent instructions, allowing the compiler to generate a more efficient instruction schedule and hence faster runtime performance. We have developed techniques to use STI for converting thread level parallelism (TLP) to ILP on VLIW/EPIC architectures. By focusing on the abundant parallelism at the procedure level in the multimedia applications, we integrate parallel procedure calls, which can be seen as threads, by gathering work in the application. We rely on the programmer to identify parallel procedures, rather than rely on compiler identification. Our methods extend whole-program optimization by expanding the scope of the compiler through software thread integration and procedure cloning. It is effectively a superset of loop jamming as it allows a larger variety of threads to be jammed together. This thesis proposes a methodology to integrate multiple threads in multimedia applications and introduce the concept of a ‘Smart RTOS’ as an execution model for utilizing integrated threads efficiently in embedded systems. We demonstrate our technique by integrating three procedures from a JPEG application at C source code level, compiling with four compilers for the Itanium EPIC architecture and measuring the performance with the on-chip performance measurement units. Experimental results show procedure speedup of up to 18% and program speedup up to 11%. Detailed performance analysis demonstrates the primary bottleneck to be the Itanium’s 16K instruction cache, which has limited room for the code expansion by STI. SOFTWARE THREAD INTEGRATION FOR CONVERTING TLP TO ILP ON VLIW/EPIC ARCHITECTURES
منابع مشابه
Simultaneous Multithreading
Current research in processor technology and computer architecture is motivated primarily by the need for greater performance. In this context, it is well understood that the performance gain from improving the memory system alone is limited, and using system Level Integration (such as supporting graphics/sound on chip) can only lead to marginal performance benefits. The most significant gain c...
متن کاملExploiting Java Instruction/Thread Level Parallelism with Horizontal Multithreading
Java bytecodes can be executed with the following three methods: a Java interpretor running on a particular machine interprets bytecodes; a Just-In-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the machine executes the translated codes; and a Java processor executes bytecodes directly. The first two methods require no special hardware support fo...
متن کاملDual-thread Weld: A Technique for Latency Tolerance in Horizontal Architectures
This paper presents dual-thread Weld architecture for VLIW/EPIC processors. The dual-thread Weld model supports one main thread and one speculative thread running simultaneously in a VLIW/EPIC processor with a register file and a fetch unit per thread. This paper analyzes the cost-performance impact of the dual-thread Weld model, which includes analysis of migrating the disambiguation hardware ...
متن کاملLoop Transformation Techniques To Aid In Loop Unrolling and Multithreading
In modern computer systems loops present a great deal of opportunities for increasing Instruction Level and Thread Level Parallelism. Loop unrolling is a technique used to obtain greater ILP while independent loop iterations are assigned to different threads to obtain greater TLP. However, techniques are needed to avoid unnecessary checks to assure that only the correct number of iterations are...
متن کاملIs There Exploitable Thread-Level Parallelism in General-Purpose Application Programs?
Most of the thread-level parallelism (TLP) being successfully exploited so far has been primarily from scientific application programs, in particular, floating-point programs. General-purpose applications, especially those written in C or C++, such as the benchmarks in SPECint2000, have primarily been exploiting only instruction-level parallelism (ILP). A lot of research has been done recently ...
متن کامل